Data from https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE157938 In that study, IMPAIRED SUCROSE INDUCTION 1 (ISI1) was identified as a player in long distance wound signaling in Arabidopsis. To further assess the impact of ISI1 on the wound response, authors tested the transcriptomes of the wounded isi1-2 mutants in comparison with wounded wild type plants, in order to find genes that were deregulated in isi1 loss-of-function mutants
Two groups of plants: control (Col-0) and isi-mutant plants (isi1-2) with three biological replicates per group; Distal leaves 13 were collected one hour after wounding leave 8 (in wounded plants) or without any treatment (unwounded plants). Two individual plants were pooled for one replicate; single read design.
Set your working directory with Kallisto results
Checking for required packages and install, if necessary
library(tidyverse)
── Attaching core tidyverse packages ────────────────────────────────────────────────────────────────────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.1 ✔ tibble 3.2.1
✔ lubridate 1.9.3 ✔ tidyr 1.3.1
✔ purrr 1.0.2
── Conflicts ──────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the ]8;;http://conflicted.r-lib.org/conflicted package]8;; to force all conflicts to become errors
library(tximport)
library(DESeq2)
Loading required package: S4Vectors
Loading required package: stats4
Loading required package: BiocGenerics
Attaching package: ‘BiocGenerics’
The following objects are masked from ‘package:lubridate’:
intersect, setdiff, union
The following objects are masked from ‘package:dplyr’:
combine, intersect, setdiff, union
The following objects are masked from ‘package:stats’:
IQR, mad, sd, var, xtabs
The following objects are masked from ‘package:base’:
anyDuplicated, aperm, append, as.data.frame, basename, cbind, colnames, dirname, do.call, duplicated, eval, evalq, Filter,
Find, get, grep, grepl, intersect, is.unsorted, lapply, Map, mapply, match, mget, order, paste, pmax, pmax.int, pmin,
pmin.int, Position, rank, rbind, Reduce, rownames, sapply, setdiff, table, tapply, union, unique, unsplit, which.max,
which.min
Attaching package: ‘S4Vectors’
The following objects are masked from ‘package:lubridate’:
second, second<-
The following objects are masked from ‘package:dplyr’:
first, rename
The following object is masked from ‘package:tidyr’:
expand
The following object is masked from ‘package:utils’:
findMatches
The following objects are masked from ‘package:base’:
expand.grid, I, unname
Loading required package: IRanges
Attaching package: ‘IRanges’
The following object is masked from ‘package:lubridate’:
%within%
The following objects are masked from ‘package:dplyr’:
collapse, desc, slice
The following object is masked from ‘package:purrr’:
reduce
Loading required package: GenomicRanges
Loading required package: GenomeInfoDb
Loading required package: SummarizedExperiment
Loading required package: MatrixGenerics
Loading required package: matrixStats
Attaching package: ‘matrixStats’
The following object is masked from ‘package:dplyr’:
count
Attaching package: ‘MatrixGenerics’
The following objects are masked from ‘package:matrixStats’:
colAlls, colAnyNAs, colAnys, colAvgsPerRowSet, colCollapse, colCounts, colCummaxs, colCummins, colCumprods, colCumsums,
colDiffs, colIQRDiffs, colIQRs, colLogSumExps, colMadDiffs, colMads, colMaxs, colMeans2, colMedians, colMins,
colOrderStats, colProds, colQuantiles, colRanges, colRanks, colSdDiffs, colSds, colSums2, colTabulates, colVarDiffs,
colVars, colWeightedMads, colWeightedMeans, colWeightedMedians, colWeightedSds, colWeightedVars, rowAlls, rowAnyNAs,
rowAnys, rowAvgsPerColSet, rowCollapse, rowCounts, rowCummaxs, rowCummins, rowCumprods, rowCumsums, rowDiffs, rowIQRDiffs,
rowIQRs, rowLogSumExps, rowMadDiffs, rowMads, rowMaxs, rowMeans2, rowMedians, rowMins, rowOrderStats, rowProds,
rowQuantiles, rowRanges, rowRanks, rowSdDiffs, rowSds, rowSums2, rowTabulates, rowVarDiffs, rowVars, rowWeightedMads,
rowWeightedMeans, rowWeightedMedians, rowWeightedSds, rowWeightedVars
Loading required package: Biobase
Welcome to Bioconductor
Vignettes contain introductory material; view with 'browseVignettes()'. To cite Bioconductor, see 'citation("Biobase")',
and for packages 'citation("pkgname")'.
Attaching package: ‘Biobase’
The following object is masked from ‘package:MatrixGenerics’:
rowMedians
The following objects are masked from ‘package:matrixStats’:
anyMissing, rowMedians
library(pheatmap)
library(RColorBrewer)
library(rhdf5)
library(WGCNA)
Loading required package: dynamicTreeCut
Loading required package: fastcluster
Attaching package: ‘fastcluster’
The following object is masked from ‘package:stats’:
hclust
Registered S3 method overwritten by 'htmlwidgets':
method from
print.htmlwidget tools:rstudio
Attaching package: ‘WGCNA’
The following object is masked from ‘package:IRanges’:
cor
The following object is masked from ‘package:S4Vectors’:
cor
The following object is masked from ‘package:stats’:
cor
library(clusterProfiler)
clusterProfiler v4.12.2 Learn more at https://yulab-smu.top/contribution-knowledge-mining/
Please cite:
S Xu, E Hu, Y Cai, Z Xie, X Luo, L Zhan, W Tang, Q Wang, B Liu, R Wang, W Xie, T Wu, L Xie, G Yu. Using clusterProfiler to
characterize multiomics data. Nature Protocols. 2024, doi:10.1038/s41596-024-01020-z
Attaching package: ‘clusterProfiler’
The following object is masked from ‘package:IRanges’:
slice
The following object is masked from ‘package:S4Vectors’:
rename
The following object is masked from ‘package:purrr’:
simplify
The following object is masked from ‘package:stats’:
filter
library(org.At.tair.db)
Loading required package: AnnotationDbi
Attaching package: ‘AnnotationDbi’
The following object is masked from ‘package:clusterProfiler’:
select
The following object is masked from ‘package:dplyr’:
select
library(ggridges)
samples <- read_tsv("phenotable.tsv", show_col_types = FALSE)
samples
files <- file.path(samples$path, "abundance.h5") # path to abundance files in kallisto dirs
files
[1] "SRR12649393_1/abundance.h5" "SRR12649394_1/abundance.h5" "SRR12649395_1/abundance.h5" "SRR12649396_1/abundance.h5"
[5] "SRR12649397_1/abundance.h5" "SRR12649398_1/abundance.h5" "SRR12649399_1/abundance.h5" "SRR12649400_1/abundance.h5"
[9] "SRR12649401_1/abundance.h5" "SRR12649402_1/abundance.h5" "SRR12649403_1/abundance.h5" "SRR12649404_1/abundance.h5"
The data is from Kallisto, a tool for quantifying RNA-seq data at the transcript level. • tximport: Imports transcript-level quantification data from Kallisto. • DESeqDataSetFromTximport: Converts the imported data into a DESeq2 dataset, setting up for differential expression analysis. • ddsTxi: Contains the DESeq2 dataset object which includes the count data, sample metadata, and design formula.
txi <- tximport(files, type = 'kallisto', txOut = T)
1 2 3 4 5 6 7 8 9 10 11 12
ddsTxi <- DESeqDataSetFromTximport(txi,
colData = samples,
design = ~ genotype*condition)
Warning in DESeqDataSet(se, design = design, ignoreRank) :
some variables in design formula are characters, converting to factors
using counts and average transcript lengths from tximport
ddsTxi
class: DESeqDataSet
dim: 48359 12
metadata(1): version
assays(2): counts avgTxLength
rownames(48359): AT1G76820.1 ATMG00060.1 ... AT1G71740.1 AT3G05780.1
rowData names(0):
colnames: NULL
colData names(4): sample genotype condition path
Filter Low-Expression Genes and relevel the genotype Factor The code keeps only those genes where the sum of counts across all samples is at least 10 ‘relevel’ changes the reference level of the genotype factor to ensures that all comparisons of genotype effects are made relative to the “WT” genotype.
dds <- ddsTxi[rowSums(counts(ddsTxi)) >= 10,]
dds$genotype <- relevel(dds$genotype, ref = "WT")
dds <- DESeq(dds, fitType = 'local')
estimating size factors
using 'avgTxLength' from assays(dds), correcting for library size
estimating dispersions
gene-wise dispersion estimates
mean-dispersion relationship
final dispersion estimates
fitting model and testing
dds
class: DESeqDataSet
dim: 37465 12
metadata(1): version
assays(6): counts avgTxLength ... H cooks
rownames(37465): AT1G76820.1 ATMG00060.1 ... AT2G27490.2 AT1G71740.1
rowData names(30): baseMean baseVar ... deviance maxCooks
colnames: NULL
colData names(4): sample genotype condition path
Variance-stabilizing transformation of data The VST aims to stabilize the variance across the range of mean expression values, making the data more suitable for downstream analyses such as clustering or principal component analysis (PCA). blind = FALSE: This argument controls whether the transformation should be performed without accounting for the experimental design. Setting blind = FALSE means that the transformation will use the information from the experimental design (i.e., it will not “blind” to the design), which is often preferred when you are interested in preserving the relationship between conditions in the transformed data. fitType = ‘local’: Specifies the method for estimating the dispersion used in the transformation. The ‘local’ option fits a local smoothing spline to the dispersion estimates, which can provide a better fit for datasets with varying dispersion values. (it can run without any fit type and set it automatically)
vsd <- vst(dds, blind = FALSE, fitType = 'local')
vsd
class: DESeqTransform
dim: 37465 12
metadata(1): version
assays(1): ''
rownames(37465): AT1G76820.1 ATMG00060.1 ... AT2G27490.2 AT1G71740.1
rowData names(30): baseMean baseVar ... maxCooks dispFit
colnames(12): 1 2 ... 11 12
colData names(4): sample genotype condition path
Let’s look at the results:
res <- results(dds)
res
log2 fold change (MLE): genotypeISI.conditionW
Wald test p-value: genotypeISI.conditionW
DataFrame with 37465 rows and 6 columns
baseMean log2FoldChange lfcSE stat pvalue padj
<numeric> <numeric> <numeric> <numeric> <numeric> <numeric>
AT1G76820.1 175.4066 0.0917138 0.2500738 0.366747 0.713808 0.987831
ATMG00060.1 12.0979 0.2923877 1.0995924 0.265906 0.790312 0.995532
AT4G16360.1 762.0852 -0.0319303 0.0806895 -0.395718 0.692313 0.985234
AT5G26800.1 299.0979 -0.2561651 0.1766372 -1.450234 0.146993 0.699143
AT4G16110.1 209.2414 0.0712498 0.2337945 0.304754 0.760554 0.993554
... ... ... ... ... ... ...
AT1G24405.1 8.52672 -0.6597421 0.931364 -0.7083612 0.478721 0.936659
AT2G27490.4 102.54183 0.2235853 0.368364 0.6069676 0.543872 0.956735
AT2G27490.1 17.19200 0.3401511 1.082508 0.3142250 0.753350 0.992456
AT2G27490.2 66.26305 0.0305085 0.514765 0.0592669 0.952740 0.999397
AT1G71740.1 2.19925 1.3243985 1.680299 0.7881923 0.430584 NA
• Base Mean provides the average expression level.
• Log2 Fold Change indicates how the expression changes between conditions.
• LfcSE gives the variability in the log2 fold change estimate.
• Stat is the Wald statistic used in hypothesis testing.
And how many of these results are significant?
res005 <- results(dds, alpha=0.05)
summary(res005)
out of 37465 with nonzero total read count
adjusted p-value < 0.05
LFC > 0 (up) : 235, 0.63%
LFC < 0 (down) : 257, 0.69%
outliers [1] : 367, 0.98%
low counts [2] : 2180, 5.8%
(mean count < 2)
[1] see 'cooksCutoff' argument of ?results
[2] see 'independentFiltering' argument of ?results
This means that out of 37,465 genes tested, 3,458 were identified as significant with a padj less than 0.05. The remaining genes did not pass this threshold and are considered insignificant.
Let’s look on names of our variables:
resultsNames(dds)
[1] "Intercept" "genotype_ISI_vs_WT" "condition_W_vs_UW" "genotypeISI.conditionW"
Intercept = genotypeWT.conditionUW. In this model, the intercept represents the baseline gene expression for the reference levels of both genotype and condition. Let’s arranged it by log2FoldChange:
order_indices <- order(-res$log2FoldChange)
res_sorted <- res[order_indices, ]
res_sorted
log2 fold change (MLE): genotypeISI.conditionW
Wald test p-value: genotypeISI.conditionW
DataFrame with 37465 rows and 6 columns
baseMean log2FoldChange lfcSE stat pvalue padj
<numeric> <numeric> <numeric> <numeric> <numeric> <numeric>
AT1G10430.1 25.68831 30 6.00876 4.99271 NA NA
AT5G03540.3 14.06154 30 6.01269 4.98944 NA NA
AT4G33520.3 5.77803 30 6.03077 4.97449 NA NA
AT1G35220.5 8.60384 30 6.01359 4.98870 NA NA
AT4G34490.2 22.75369 30 6.00918 4.99236 NA NA
... ... ... ... ... ... ...
AT3G49420.3 8.29872 -30 6.01296 -4.98922 NA NA
AT3G07740.4 5.05402 -30 6.01742 -4.98552 6.17942e-07 0.000338702
AT1G76350.3 5.31241 -30 6.01653 -4.98626 6.15596e-07 0.000338702
AT4G21150.2 10.23504 -30 6.01242 -4.98967 NA NA
AT1G80000.2 28.25891 -30 6.00807 -4.99329 5.93603e-07 0.000338702
Visualisation for the first gene
# Gene counts for a specific gene
plotCounts(dds, gene="AT1G10430.1", intgroup="genotype")
plotCounts(dds, gene="AT1G10430.1", intgroup="condition")
Volkano plot
top_genes <- res %>%
as.data.frame() %>%
filter(!is.na(log2FoldChange), !is.na(padj), padj < 0.05, log2FoldChange >= -2.5, log2FoldChange <= 2.5) %>%
arrange(desc(abs(log2FoldChange))) %>%
head(10) %>%
mutate(label = rownames(.))
ggplot(res %>%
as.data.frame() %>%
filter(!is.na(log2FoldChange), !is.na(padj), log2FoldChange >= -2.5, log2FoldChange <= 2.5),
aes(log2FoldChange, -log10(padj), color = padj < 0.05)) +
geom_point() +
scale_color_manual(values = c("black", "red")) +
xlim(c(-2.5, 2.5)) +
geom_text(data = top_genes, aes(label = label), size = 2, vjust = -1, hjust = 1) +
theme_minimal() +
labs(color = "Adjusted p-value")
** PCA plot **
plotPCA(vsd, intgroup=c("genotype", "condition"))
using ntop=500 top features by variance
Ideally, each group of replicates should cluster together in a PCA
plot, reflecting similar gene expression profiles within each group. -
Outlier in the ISI:UW Group: The fact that
one replicate from the ISI:UW group clusters closer to the
WT:UW group could suggest that the ISI1 loss-of-function
mutation was not fully penetrant or did not manifest as strongly in that
particular replicate. As a result, its gene expression profile is more
similar to the wild-type unwounded (WT:UW) condition rather
than to its intended group (ISI:UW).
This could occur if the mutation in ISI1 is variable in its effect or if there was some technical or biological reason why this replicate did not exhibit the expected mutant phenotype.
Clear Division Between W and UW
Groups: The distinct separation of the wounded (W)
and unwounded (UW) samples on the PCA plot indicates that
the wounding has a strong and consistent effect on gene expression,
regardless of genotype.
** MA plot **
Under a null hypothesis of no differential expression, most genes’ log2 fold changes should center tightly around 0. However, genes with higher absolute M values exhibit greater differential expression between conditions. An ideal MA plot will show the majority of genes along M=0 and outliers stretching towards the graph peripheries.
plotMA(dds)
Here the difference in expression of the entire model of differential expression is shown and we see that there are not many differentially expressed genes and some groups are noticeable in the graph. Let’s divide this graph by conditions and look at it separately. Wounded VS Unwouned
res <- results(dds, contrast=c("condition", "W", "UW"))
plotMA(res)
This plot looks more like an ideal MA graph with a large number of differentially expressed genes. What about ISI1-mutated VS WT?
res <- results(dds, contrast=c("genotype", "ISI", "WT"))
plotMA(res)
This plot appears to be close with the null hypothesis, and reveals small effect of the genotype. Plot a heatmap of 100 most expressed genes
select <- order(rowMeans(counts(dds,normalized=TRUE)),
decreasing=TRUE)[1:100]
df <- as.data.frame(colData(dds)[,c("genotype", "condition")])
pheatmap(assay(vsd)[select,],
cluster_rows = TRUE,
show_rownames = TRUE,
cluster_cols = TRUE,
annotation_col = df,
fontsize_row = 6) # Adjust fontsize_row to your preference
We have a distinct separation of the wounded (W) and
unwounded (UW) samples again, which indicates that the
wounding has a strong and consistent effect on gene expression,
regardless of genotype.
Plot of the distance between samples heatmap
sampleDists <- dist(t(assay(vsd)))
sampleDistMatrix <- as.matrix(sampleDists)
rownames(sampleDistMatrix) <- paste(vsd$genotype, vsd$condition, sep="-")
colnames(sampleDistMatrix) <- paste(vsd$genotype, vsd$condition, sep="-")
colors <- colorRampPalette( rev(brewer.pal(9, "Blues")) )(255)
pheatmap(sampleDistMatrix,
clustering_distance_rows=sampleDists,
clustering_distance_cols=sampleDists,
color = colors)
Same separation of the wounded (W) and unwounded
(UW) samples again
res
log2 fold change (MLE): genotype ISI vs WT
Wald test p-value: genotype ISI vs WT
DataFrame with 37465 rows and 6 columns
baseMean log2FoldChange lfcSE stat pvalue padj
<numeric> <numeric> <numeric> <numeric> <numeric> <numeric>
AT1G76820.1 175.4066 -0.2649083 0.1737217 -1.524900 0.12728393 0.757109
ATMG00060.1 12.0979 0.8078983 0.7976565 1.012840 0.31113669 0.930262
AT4G16360.1 762.0852 0.1453457 0.0553858 2.624239 0.00868427 0.253548
AT5G26800.1 299.0979 -0.0724616 0.1257496 -0.576237 0.56445501 0.999948
AT4G16110.1 209.2414 -0.0640335 0.1652295 -0.387543 0.69835422 0.999948
... ... ... ... ... ... ...
AT1G24405.1 8.52672 0.396784 0.678618 0.584695 0.558753 0.999948
AT2G27490.4 102.54183 -0.265294 0.263356 -1.007359 0.313762 0.931444
AT2G27490.1 17.19200 -0.436321 0.776040 -0.562240 0.573952 0.999948
AT2G27490.2 66.26305 0.313981 0.366080 0.857685 0.391067 0.970626
AT1G71740.1 2.19925 -1.849487 1.234794 -1.497810 0.134183 NA
summary(res)
out of 37465 with nonzero total read count
adjusted p-value < 0.1
LFC > 0 (up) : 346, 0.92%
LFC < 0 (down) : 217, 0.58%
outliers [1] : 367, 0.98%
low counts [2] : 2906, 7.8%
(mean count < 3)
[1] see 'cooksCutoff' argument of ?results
[2] see 'independentFiltering' argument of ?results
gene_ids <- rownames(res)
head(gene_ids, 99)
[1] "AT1G76820.1" "ATMG00060.1" "AT4G16360.1" "AT5G26800.1" "AT4G16110.1" "AT4G15130.1" "ATMG01320.1" "AT3G57180.1" "AT2G06210.1"
[10] "AT1G60260.1" "AT1G64460.1" "AT5G18710.1" "AT5G60900.1" "AT4G15950.1" "AT3G08943.1" "AT3G52310.1" "AT4G16150.1" "AT3G55270.1"
[19] "AT1G60930.1" "AT4G35335.1" "AT5G11100.1" "AT1G18320.1" "AT5G16970.1" "AT2G43120.2" "AT2G43120.1" "AT3G18710.1" "AT4G25880.4"
[28] "AT4G25880.2" "AT4G25880.5" "AT4G25880.1" "AT4G25880.3" "AT1G78930.1" "AT3G07950.1" "AT4G03450.2" "AT4G03450.1" "AT1G71695.1"
[37] "AT1G58983.1" "AT1G12980.1" "AT5G41480.1" "AT3G51870.1" "AT4G29950.3" "AT4G29950.1" "AT4G29950.2" "AT3G18310.1" "AT5G07360.3"
[46] "AT5G07360.1" "AT5G07360.2" "AT4G08535.1" "AT4G08535.2" "AT4G26490.1" "AT5G46240.1" "AT5G62260.3" "AT5G62260.1" "AT4G16955.1"
[55] "AT3G25717.1" "AT5G41908.1" "AT2G21390.1" "AT3G48330.2" "AT3G48330.3" "AT3G48330.4" "AT3G48330.1" "AT4G32710.1" "AT3G46440.1"
[64] "AT3G46440.2" "AT1G56250.1" "AT3G16830.1" "AT5G04950.1" "AT4G13720.1" "AT4G13720.2" "AT5G50110.1" "AT5G50110.2" "AT3G03305.1"
[73] "AT2G38310.1" "AT2G26900.1" "AT1G69810.1" "AT1G72450.2" "AT1G72450.1" "AT5G03560.3" "AT5G03560.4" "AT5G03560.1" "AT5G03560.2"
[82] "AT4G27340.3" "AT4G27340.2" "AT4G27340.1" "AT1G76280.4" "AT1G76280.2" "AT1G76280.5" "AT1G76280.1" "AT3G03272.1" "AT3G28960.1"
[91] "AT3G28960.4" "AT3G28960.3" "AT3G28960.2" "AT3G28960.5" "AT3G10210.1" "AT4G27830.2" "AT4G27830.1" "AT5G46260.1" "AT2G32710.2"
Understanding the Gene Identifier “AT1G76820.1”
The nomenclature “AT1G76820.1” is commonly used in genomic data to denote gene identifiers in various databases. In this case, it is a nomenclature from the TAIR (The Arabidopsis Information Resource) database. Here is a breakdown of this identifier:
1. AT: Indicates that the identifier refers to the plant Arabidopsis thaliana (always specified when referring to Arabidopsis).
2. 1: Refers to the chromosome on which the gene is located. In this case, it is the first chromosome.
3. G76820: This is a unique gene number on this chromosome.
4. .1: Version of the gene annotation. In this case, it is the first version of the annotation for this gene. Sometimes, additional versions are indicated as .2, .3, etc., if the annotation has been updated.
Filtering out only the significant results
sign_results <- res %>%
as.data.frame %>%
rownames_to_column("gene_name") %>%
mutate(gene_name = gsub("\\.\\d+$", "", gene_name)) %>%
filter(padj < .05)
sign_results
Genes that increase and decrease expression are considered separately
sign_up <- sign_results %>% filter(log2FoldChange > 0)
sign_dw <- sign_results %>% filter(log2FoldChange < 0)
Calculating GO enrichment
This function performs Gene Ontology (GO) enrichment analysis. It tests whether the genes in your input list are overrepresented in specific GO categories compared to a background gene set. The plot will display the top 20 Biological Process (BP) GO terms that are significantly enriched in your list of genes. Each dot represents a GO term, with the size and color typically indicating the significance of the enrichment.
GO_enrich <- enrichGO(sign_up$gene_name, "org.At.tair.db", keyType = "TAIR", ont = "BP")
plt <- dotplot(GO_enrich, showCategory = 20)
plt + theme(
axis.text.y = element_text(size = 6)
)
The top GO terms suggest that your gene set is significantly associated with processes related to amino acid metabolism, response to stress signals like jasmonic acid, and various biosynthetic pathways.
GO enricment table
GO_enrich@result
GO_enrich <- enrichplot::pairwise_termsim(GO_enrich)
plt <- emapplot(GO_enrich,
repel = TRUE,
showCategory = 12,
cluster.params = list(label_style = list(size = 3)))
plt
goplot(GO_enrich)
Warning: ggrepel: 2 unlabeled data points (too many overlaps). Consider increasing max.overlaps
GSEA is a method used in genomics to determine whether a set of genes (a gene set) shows statistically significant, concordant differences between two biological states.
How It Works?
1. **Ranking Genes:** FGSEA begins by ranking genes based on their association with a particular phenotype or condition. This is often done using metrics such as fold change or t-statistics from differential expression analysis.
2. **Enrichment Score Calculation:** It calculates an enrichment score for each gene set by examining how the genes in the set are distributed within the ranked list. It uses a fast algorithm to estimate the enrichment score, which reflects the concentration of genes from the set at the top or bottom of the ranked list.
3. **Statistical Significance:** FGSEA performs a permutation test to assess the statistical significance of the enrichment scores, providing p-values that indicate whether the observed enrichment is greater than what would be expected by chance.
Downloading .gmt file
pathway <- fgsea::gmtPathways('wikipathways-20240410-gmt-Arabidopsis_thaliana.gmt')
Creating ranks and a dictionary for translating rank names
translated_names <- res %>%
as.data.frame() %>%
na.omit() %>%
rownames_to_column("gene_name") %>%
mutate(gene_name = gsub("\\.\\d+$", "", gene_name)) %>%
pull(gene_name) %>%
bitr('TAIR', c("GENENAME", "ENTREZID"), 'org.At.tair.db') %>%
distinct(TAIR, .keep_all = T)
'select()' returned 1:many mapping between keys and columns
Warning in bitr(., "TAIR", c("GENENAME", "ENTREZID"), "org.At.tair.db") :
68.98% of input gene IDs are fail to map...
Сreating a rank vector
ranks_for_gsea <- res %>%
as.data.frame() %>%
na.omit() %>%
arrange(desc(stat)) %>%
rownames_to_column("gene_name") %>%
mutate(TAIR = gsub("\\.\\d+$", "", gene_name)) %>%
left_join(translated_names) %>%
mutate(NAME = ifelse(is.na(ENTREZID), TAIR, ENTREZID)) %>%
transmute(NAME, stat) %>%
deframe()
Joining with `by = join_by(TAIR)`
head(ranks_for_gsea, 100)
AT2G42640 AT1G07050 827377 827377 AT2G27810 AT4G31980 843133 AT1G17665 AT3G07090 AT5G59830 832706 AT4G38250 AT5G48250
20.246724 11.120257 8.986663 8.053232 7.475392 7.469084 7.378915 7.096614 6.665295 6.529442 6.291401 6.282042 6.268846
835868 817745 AT2G29340 AT1G04620 836062 AT4G19520 AT5G23240 826716 837158 AT1G01240 844228 826983 AT2G43850
6.088399 6.006504 5.921277 5.909854 5.906834 5.861006 5.789372 5.784426 5.712031 5.493775 5.491014 5.484837 5.410686
AT1G76790 AT5G05670 AT4G30310 AT5G35735 820771 AT5G38420 AT2G21530 831059 841749 AT4G03820 836034 AT2G43945 828419
5.390705 5.348861 5.346896 5.293781 5.256045 5.247263 5.237702 5.234711 5.211179 5.206994 5.169796 5.141525 5.126750
AT4G12830 839924 835567 AT3G28460 817513 AT1G43670 AT5G09850 AT4G00165 AT4G16146 AT2G24600 838591 825880 AT5G53220
5.124072 5.116142 5.110733 5.104006 5.058099 5.048352 5.037742 5.000387 4.983393 4.921529 4.902252 4.886903 4.874522
AT3G63160 814915 AT2G37020 AT5G43880 830548 814835 AT1G61430 AT4G02530 AT5G01660 AT3G22790 824525 AT1G25150 831201
4.872462 4.871772 4.867801 4.835534 4.833661 4.812673 4.802786 4.799946 4.796065 4.774055 4.768328 4.762493 4.761245
AT4G18240 819244 AT2G33490 825503 AT1G23040 AT5G38460 842157 822537 AT3G47800 828777 AT5G02950 AT1G17610 819989
4.749131 4.737943 4.709552 4.662044 4.658823 4.632480 4.628964 4.597351 4.573431 4.567163 4.552039 4.539813 4.533890
AT5G50565 824661 839921 AT2G27080 AT4G31880 AT5G19140 AT5G62720 AT4G17100 817112 AT5G04460 838301 AT2G30480 AT4G26670
4.520138 4.511153 4.475071 4.470800 4.448872 4.447919 4.418118 4.373725 4.371476 4.358213 4.356436 4.353084 4.349306
AT2G22980 AT3G03770 AT2G19930 818723 AT4G32340 AT3G48740 AT5G62360 835938 AT1G12750
4.338115 4.336853 4.336747 4.333774 4.318183 4.308816 4.303773 4.301866 4.295312
FGSEA calculating
fgsea_results <- fgsea::fgseaMultilevel(pathway, ranks_for_gsea)
Warning in preparePathwaysAndStats(pathways, stats, minSize, maxSize, gseaParam, :
There are ties in the preranked stats (0.09% of the list).
The order of those tied genes will be arbitrary, which may produce unexpected results.
Warning in preparePathwaysAndStats(pathways, stats, minSize, maxSize, gseaParam, :
There are duplicate gene names, fgsea may produce unexpected results.
head(fgsea_results[order(pval), ])
GSEA plot for one path
fgsea::plotEnrichment(pathway[["Seed development%WikiPathways_20240410%WP2279%Arabidopsis thaliana"]], ranks_for_gsea) + labs(title="Seed development")
Drawing the top 10 most enriched paths from the top and bottom of the ranked list
topPathwaysUp <- fgsea_results[ES > 0][head(order(pval), n=10), pathway]
topPathwaysDown <- fgsea_results[ES < 0][head(order(pval), n=10), pathway]
topPathways <- c(topPathwaysUp, rev(topPathwaysDown))
fgsea::plotGseaTable(pathway[topPathways], ranks_for_gsea, fgsea_results,
gseaParam=0.5)
FGSEA by clusterProfiler
GO_gsea <- gseGO(ranks_for_gsea, ont = "ALL", 'org.At.tair.db', eps = 0)
using 'fgsea' for GSEA analysis, please cite Korotkevich et al (2019).
preparing geneSet collections...
GSEA analysis...
Warning in preparePathwaysAndStats(pathways, stats, minSize, maxSize, gseaParam, :
There are ties in the preranked stats (0.09% of the list).
The order of those tied genes will be arbitrary, which may produce unexpected results.
Warning in preparePathwaysAndStats(pathways, stats, minSize, maxSize, gseaParam, :
There are duplicate gene names, fgsea may produce unexpected results.
leading edge analysis...
done...
Ridge plot
plt <- ridgeplot(GO_gsea, showCategory = 20)
plt + theme(
axis.text.y = element_text(size = 8)
)
Picking joint bandwidth of 0.248
Some of the paths can be related to plant wounding and how plants respond to injury or stress:
gseaplot(GO_gsea, 1, title = GO_gsea@result$Description[[2]])
datExpr <- t(assay(vsd))
rownames(datExpr) <- colData(vsd)$sample
Clustering samples
sampletree <- hclust(dist(datExpr), method = "average")
plot(sampletree, cex = 0.5)
# Pick powers for coexpression network construction
powers <- c(c(1:10), seq(from = 15, to=50, by=5))
sft <- pickSoftThreshold(datExpr, powerVector = powers, verbose = 5)
pickSoftThreshold: will use block size 1194.
pickSoftThreshold: calculating connectivity for given powers...
..working on genes 1 through 1194 of 37465
Warning: executing %dopar% sequentially: no parallel backend registered
..working on genes 1195 through 2388 of 37465
..working on genes 2389 through 3582 of 37465
..working on genes 3583 through 4776 of 37465
..working on genes 4777 through 5970 of 37465
..working on genes 5971 through 7164 of 37465
..working on genes 7165 through 8358 of 37465
..working on genes 8359 through 9552 of 37465
..working on genes 9553 through 10746 of 37465
..working on genes 10747 through 11940 of 37465
..working on genes 11941 through 13134 of 37465
..working on genes 13135 through 14328 of 37465
..working on genes 14329 through 15522 of 37465
..working on genes 15523 through 16716 of 37465
..working on genes 16717 through 17910 of 37465
..working on genes 17911 through 19104 of 37465
..working on genes 19105 through 20298 of 37465
..working on genes 20299 through 21492 of 37465
..working on genes 21493 through 22686 of 37465
..working on genes 22687 through 23880 of 37465
..working on genes 23881 through 25074 of 37465
..working on genes 25075 through 26268 of 37465
..working on genes 26269 through 27462 of 37465
..working on genes 27463 through 28656 of 37465
..working on genes 28657 through 29850 of 37465
..working on genes 29851 through 31044 of 37465
..working on genes 31045 through 32238 of 37465
..working on genes 32239 through 33432 of 37465
..working on genes 33433 through 34626 of 37465
..working on genes 34627 through 35820 of 37465
..working on genes 35821 through 37014 of 37465
..working on genes 37015 through 37465 of 37465
# Plot results
par(mfrow = c(1,2));
cex1 = 0.9;
# Scale-free topology fit index as a function of the soft-thresholding power
plot(sft$fitIndices[,1], -sign(sft$fitIndices[,3])*sft$fitIndices[,2],
xlab="Soft Threshold (power)",ylab="Scale Free Topology Model Fit,signed R^2",type="n",
main = paste("Scale independence"));
text(sft$fitIndices[,1], -sign(sft$fitIndices[,3])*sft$fitIndices[,2],
labels=powers,cex=cex1,col="red");
# this line corresponds to using an R^2 cut-off of h
abline(h=0.90,col="red")
# Mean connectivity as a function of the soft-thresholding power
plot(sft$fitIndices[,1], sft$fitIndices[,5],
xlab="Soft Threshold (power)",ylab="Mean Connectivity", type="n",
main = paste("Mean connectivity"))
text(sft$fitIndices[,1], sft$fitIndices[,5], labels=powers, cex=cex1,col="red")
Claster dendrogram
# construct the network
net <- blockwiseModules(datExpr, power = 7,
TOMType = "unsigned", minModuleSize = 30,
reassignThreshold = 0, mergeCutHeight = 0.25,
numericLabels = TRUE, pamRespectsDendro = FALSE,
saveTOMs = FALSE,
saveTOMFileBase = "yeastTOM",
verbose = 3)
Calculating module eigengenes block-wise from all genes
Flagging genes and samples with too many missing values...
..step 1
....pre-clustering genes to determine blocks..
Projective K-means:
..k-means clustering..
..merging smaller clusters...
Block sizes:
gBlocks
1 2 3 4 5 6 7 8 9
4993 4980 4973 4728 4529 3866 3824 3055 2517
..Working on block 1 .
TOM calculation: adjacency..
..will not use multithreading.
Fraction of slow calculations: 0.000000
..connectivity..
..matrix multiplication (system BLAS)..
..normalization..
..done.
....clustering..
....detecting modules..
....calculating module eigengenes..
....checking kME in modules..
..removing 1 genes from module 5 because their KME is too low.
..Working on block 2 .
TOM calculation: adjacency..
..will not use multithreading.
Fraction of slow calculations: 0.000000
..connectivity..
..matrix multiplication (system BLAS)..
..normalization..
..done.
....clustering..
....detecting modules..
....calculating module eigengenes..
....checking kME in modules..
..Working on block 3 .
TOM calculation: adjacency..
..will not use multithreading.
Fraction of slow calculations: 0.000000
..connectivity..
..matrix multiplication (system BLAS)..
..normalization..
..done.
....clustering..
....detecting modules..
....calculating module eigengenes..
....checking kME in modules..
..Working on block 4 .
TOM calculation: adjacency..
..will not use multithreading.
Fraction of slow calculations: 0.000000
..connectivity..
..matrix multiplication (system BLAS)..
..normalization..
..done.
....clustering..
....detecting modules..
....calculating module eigengenes..
....checking kME in modules..
..Working on block 5 .
TOM calculation: adjacency..
..will not use multithreading.
Fraction of slow calculations: 0.000000
..connectivity..
..matrix multiplication (system BLAS)..
..normalization..
..done.
....clustering..
....detecting modules..
....calculating module eigengenes..
....checking kME in modules..
..removing 2 genes from module 3 because their KME is too low.
..Working on block 6 .
TOM calculation: adjacency..
..will not use multithreading.
Fraction of slow calculations: 0.000000
..connectivity..
..matrix multiplication (system BLAS)..
..normalization..
..done.
....clustering..
....detecting modules..
....calculating module eigengenes..
....checking kME in modules..
..removing 2 genes from module 6 because their KME is too low.
..Working on block 7 .
TOM calculation: adjacency..
..will not use multithreading.
Fraction of slow calculations: 0.000000
..connectivity..
..matrix multiplication (system BLAS)..
..normalization..
..done.
....clustering..
....detecting modules..
....calculating module eigengenes..
....checking kME in modules..
..Working on block 8 .
TOM calculation: adjacency..
..will not use multithreading.
Fraction of slow calculations: 0.000000
..connectivity..
..matrix multiplication (system BLAS)..
..normalization..
..done.
....clustering..
....detecting modules..
....calculating module eigengenes..
....checking kME in modules..
..Working on block 9 .
TOM calculation: adjacency..
..will not use multithreading.
Fraction of slow calculations: 0.000000
..connectivity..
..matrix multiplication (system BLAS)..
..normalization..
..done.
....clustering..
....detecting modules..
....calculating module eigengenes..
....checking kME in modules..
..merging modules that are too close..
mergeCloseModules: Merging modules whose distance is less than 0.25
Calculating new MEs...
module_colors <- labels2colors(net$colors)
# Plot the dendrogram and the module colors underneath
plotDendroAndColors(net$dendrograms[[1]], module_colors[net$blockGenes[[1]]],
"Module colors",
dendroLabels = FALSE, hang = 0.03,
addGuide = TRUE, guideHang = 0.05)
Module behavior across the dataset
moduleLabels <- net$colors
moduleColors <- labels2colors(net$colors)
# Extract module eigengenes
MEs <- net$MEs
geneTree <- net$dendrograms[[1]]
MEs0 <- moduleEigengenes(datExpr, moduleColors)$eigengenes
MEs <- orderMEs(MEs0)
# Plot module behavior across the dataset
MEs %>%
as.data.frame() %>%
rownames_to_column("sample") %>%
ggplot(aes(sample, MEblue))+
geom_point()+
theme_bw()
Are modules enriched in something?
yellow_module <- colnames(datExpr)[module_colors=="yellow"] %>% str_replace("_mRNA", "")
red_module <- colnames(datExpr)[module_colors=="red"] %>% str_replace("_mRNA", "")
# Are modules enriched in something?
GO_enrich <- enrichGO(sign_up$gene_name, "org.At.tair.db", keyType = "TAIR", ont = "BP")
barplot(GO_enrich)